Today's Menu¶

  • Audio Signal Processing
    • Theory
    • A practical example
  • Your next tasks
  • Presentations

Are we complete?¶

  • Team Task 1 (Ioan-Cristian, Dominik, Michael, Anna)?
  • Team Task 4 (Aleksandr, Fabian, Lara, Maximilian)?
  • Team Task 6 (Giovanni, Fabian, Youniss)?
  • Team AES: QVIM (Andrei, Anton Atanasov, Aleksandar)?

Tips & Tricks for building an ML pipeline [Wednesday, April 2]¶

Pipeline - Expectation¶

drawing

Pipeline - Reality¶

drawing

Best Practices for ML Pipelines¶

  • Document everything
    Your future self (and your teammates) should understand what you did, why you did it, and how it works.

  • Double-check your steps
    Small mistakes (e.g., data leakage, wrong labels, off-by-one errors) can silently propagate and waste hours later.

  • Log everything
    Track parameters, results, errors, and unexpected behaviors. Good logs = easy debugging.

  • Use version control
    Commit early, commit often. Track your code and configs to understand changes over time.

  • Automate where possible
    Make your pipeline reproducible. E.g., use scripts to run multiple commands.

  • Keep it simple (at first)
    Start with a minimal working version, then iterate. Don’t over-engineer early.

Documentation¶

  • Maintain a continuously updated work log
    Your colleagues (and future you) should be able to quickly understand what you've done and why.

  • A good work log saves time when writing your technical report
    You’ll thank yourself later when everything is already written down.

  • Store your work log in version control
    Or at least use a shared document if collaborating—keep everything accessible and trackable.

  • Collect questions for the teaching team in your work log
    This helps identify common issues and makes discussions more efficient.

  • Add comments to your code, focusing on why things are done
    Don't just write what the code does—explain the reasoning behind key decisions.

Version Control¶

  • Use Git — it's the standard tool for version control
    Learn the basics well (commits, branches, merges, resolving conflicts).

  • When collaborating, work on your own branches
    This avoids conflicts and allows for parallel development.

  • Merge frequently
    Don’t let branches drift too far apart — regular merges help catch issues early.

  • Keep the main/master branch stable and runnable
    Treat it as your working baseline; don’t break it.

  • Review new features as a group before merging
    Ensure everyone is on the same page and avoid unexpected issues in shared code.

Testing & Sanity Checks¶

Run small, regular tests to avoid wasting time debugging later:

  • Exploratory tests
    Try out unfamiliar libraries or functions in a Jupyter Notebook to quickly learn how they behave.

  • Sanity checks for your own code
    Test functions like data loading, augmentation, or feature extraction on a few samples. Print shapes, min/max, or visualize outputs.

  • Visual inspection
    Plot spectrograms, embeddings, or model outputs to ensure your pipeline behaves as expected.

  • Overfit a tiny batch
    A classic trick: your model should be able to overfit 1–2 training examples. If it can’t, something’s likely wrong.

  • Group review of key changes
    Before running full experiments, verify new code together — four eyes catch more bugs than two.

Reproducibility: What Can Break It?¶

  • Code & Dependencies

    • Changing library versions (e.g., PyTorch, NumPy) can change behavior
    • Solution: use virtual environments (e.g., conda, venv) and freeze dependencies (requirements.txt or environment.yml)
  • Training Pipeline

    • Pseudo-random number generators (PRNGs): Python, NumPy, PyTorch, etc.
    • Parallelism (e.g., multi-GPU, DataLoader workers) can introduce variability
    • Non-deterministic GPU ops (e.g., certain cuDNN kernels, FP16 ops)
  • Data Preprocessing

    • Augmentations, shuffling, random splits, and label generation must be deterministic
    • Even slight differences (e.g., rounding, file order) can change results

Reproducibility: How to make deterministic?¶

# Set all PRNGs
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)

# Enforce deterministic ops (optional, slows things down)
torch.use_deterministic_algorithms(True)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

pl.seed_everything(SEED, workers=True) # Let Lightning seed everything, including workers

loader = DataLoader(
    dataset,
    shuffle=True,
    generator=torch.Generator().manual_seed(SEED), # Reproducible DataLoader shuffling
    ...
)

log = {"seed": SEED, "git_commit": ..., "torch_version": torch.__version__, ...}

Data Leakage¶

  • Always use the provided split for your task
  • Ensure that no information from the test set leaks into training or validation. This includes label distribution, normalization stats, data augmentations, etc.
  • Normalization leakage is a common pitfall: Don’t compute mean/std over the full dataset — only use training data stats.
  • Best practice: Treat your test set as untouchable — only access it once, at the very end, for final evaluation.
drawing

Leakage across devices¶

  • In DCASE Task 1, recording devices are a major source of domain shift.
  • Measuring generalization performance to unseen devices is crucial.
  • Splitting data without considering the recording device causes leakage and may inflate your results. Use the provided split, which accounts for device separation.
drawing

PyTorch Lightning: Minimal Interface¶

class MyModel(pl.LightningModule):
    def __init__(self, n_classes=10):
        super().__init__()
        self.model = torch.nn.Linear(128, n_classes)  # example model
        self.validation_step_outputs = []

    def forward(self, x):
        return self.model(x)  # inference step

    def training_step(self, batch, batch_idx):
        x, y = batch
        loss = F.cross_entropy(self(x), y)
        return loss

    def validation_step(self, batch, batch_idx):
        x, y = batch
        self.validation_step_outputs.append(...)
        return val_loss

    def on_validation_epoch_end(self):
        self.log("val/accuracy", ...)
        self.validation_step_outputs.clear()

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=1e-3)

if __name__ == "__main__":
    model = MyModel(n_classes=10)
    train_loader, val_loader = ...  # define DataLoaders
    trainer = pl.Trainer(max_epochs=10)
    trainer.fit(model, train_loader, val_loader)

Weights and Biases: Minimal Interface¶

def main(config):
    wandb_logger = WandbLogger(
        project="my-project",
        name=config.experiment_name,
        config=vars(config)  # logs all argparse args as W&B config
    )
    
    trainer = pl.Trainer(
        max_epochs=config.n_epochs,
        logger=wandb_logger
    )
    ...

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--lr", type=float, default=1e-3)
    parser.add_argument("--n_epochs", type=int, default=10)
    parser.add_argument("--experiment_name", type=str, default="minimal-wandb-run")
    args = parser.parse_args()

    try:
        commit_hash = subprocess.check_output(["git", "rev-parse", "HEAD"]).decode("ascii").strip()
    except Exception:
        commit_hash = "unknown"

    args.commit_hash = commit_hash  # include in logged config

    # save library versions for reproducibility
    args.versions = {
        "torch": torch.__version__,
        ...
    }

    main(args)

Audio Signal Processing¶

Audio Signal Processing¶

Disclaimer:¶

The following section is a practical introduction to audio signal processing for deep learning. It is meant to help you:

  • Understand common steps and hyperparameters used in audio pipelines
  • Interpret preprocessing choices in research papers and codebases

What this is not:

  • A deep dive into the mathematical theory behind signal processing
  • A substitute for a full course on digital signal processing (DSP)

If you're curious to go deeper into the math, we highly recommend
📘 The Scientist and Engineer's Guide to Digital Signal Processing

Overview – Theory¶

Let's do a speedrun across the theoretical foundations of a typical audio signal processing pipeline:

  • Sound and Digital Audio Signals: What is sound, and how do we convert it into a digital signal?
  • Digital Filters: Tools to process the digital signal in time or frequency domain
  • Discrete Fourier Transform (DFT): Analyze a digital signal in terms of its frequency components
  • Magnitude Spectrum: Convert the complex spectrum into a magnitude spectrum
  • Frequency Resolution: Understand the frequency spacing and resolution of the spectrum
  • Spectral Leakage and Windowing: Use window functions to reduce spectral leakage
  • Short-Time Fourier Transform (STFT): Convert a time-domain signal into a time–frequency representation
  • Mel Transform: Compress the frequency axis based on human auditory perception
  • Logarithmic Compression of Amplitude: Compress the dynamic range → log mel spectrogram

An example¶

Our starting point: ... a dog barking

In [2]:
sr = 32000 
example_wav, _ = liro.load(example_file, sr=sr)
wav_plot(example_wav, sr)
No description has been provided for this image
Your browser does not support the audio element.

Long Story Short¶

In [3]:
from torchaudio import transforms

waveform, sample_rate = torchaudio.load(example_file)
transform = transforms.MelSpectrogram(sample_rate, n_fft=800, n_mels=80)
mel_specgram = transform(waveform)
mel_specgram = torch.log(mel_specgram + 1e-5)
plot_spectrogram(mel_specgram, waveform, sample_rate)
No description has been provided for this image

Sound and Digital Audio Signals¶

  • Sound: variation in air pressure at a point in space as a function of time
  • The microphone turns the mechanical energy of a soundwave into an analog electrical signal
  • To process it with a computer, we convert it into a digital signal, which involves:
    • Any ideas?

Sound and Digital Audio Signals¶

  • Sound: variation in air pressure at a point in space as a function of time
  • The microphone turns the mechanical energy of a soundwave into an analog electrical signal
  • To process it with a computer, we convert it into a digital signal, which involves:
    • Sampling: measuring the signal at regular time intervals (sampling rate, e.g., 44,100 times per second)
    • Quantization: rounding each sample to a fixed set of amplitude levels (bit depth, e.g., 16-bit resolution)
drawing

The Sampling Theorem¶

  • If the signal is "properly sampled", it can be reconstructed exactly from its samples

  • A continuous (analog) signal is "properly sampled" if it contains no frequency components above half the sampling rate

    • This limit is called the Nyquist frequency
    • Example: for a sampling rate of 44,100 Hz → Nyquist frequency = 22,050 Hz
  • Frequencies above the Nyquist limit will be aliased — incorrectly folded into lower frequencies: → results in distortion and irrecoverable information loss

  • To prevent aliasing: Apply an analog low-pass filter before digitizing the signal

The Sampling Theorem¶

drawing

Digital Filters¶

  • Digital filters are essential tools in audio processing
    → used to enhance, suppress, or separate signal components

  • Can operate in the time or frequency domain

  • Defined by their impulse response or frequency response

  • Two main types:

    • FIR (Finite Impulse Response) → implemented via convolution
    • IIR (Infinite Impulse Response) → uses recursion (feedback)

See the pre-emphasis filter in the example pipeline

Discrete Fourier Transform (DFT)¶

  • The DFT can be used to analyze digital signals in terms of their frequency components
  • It assumes the signal is finite, periodic, and repeats infinitely in both directions
  • The result is a set of complex coefficients, each representing a sinusoidal basis function:

$$ X[k] = \frac{1}{N} \sum_{n=0}^{N-1} x[n] \cdot e^{\frac{-2\pi i k n}{N}} $$

  • $X[k]$: complex amplitude of the $k$-th frequency bin
  • $N$: number of time samples
  • The exponent represents a complex sinusoid (a rotating vector)

Discrete Fourier Transform (DFT)¶

Using Euler’s formula:

$$ e^{ix} = \cos(x) + i \sin(x) $$

we can write:

$$ X[k] = \frac{1}{N} \sum_{n=0}^{N-1} x[n] \cdot \cos\left(\frac{-2\pi k n}{N}\right) + i \cdot x[n] \cdot \sin\left(\frac{-2\pi k n}{N}\right) $$

  • The real part corresponds to how much of a cosine wave (frequency $k$) is present in the signal
  • The imaginary part corresponds to how much of a sine wave (same frequency $k$) is present

$\rightarrow$ The DFT tells us how much of each sine and cosine wave at frequency $k$ is needed to reconstruct the signal.

Discrete Fourier Transform (DFT)¶

Basis functions for 16-point DFT:

drawing

Discrete Fourier Transform (DFT)¶

  • The time-domain signal consists of $N$ samples: $x[0] \dots x[N-1]$
  • The DFT translates this into $\frac{N}{2} + 1$ sine (imaginary) and cosine (real) amplitudes
  • The DFT can be computed via:
    • Matrix multiplication with sine/cosine basis → $O(N^2)$
    • Or much faster using the Fast Fourier Transform (FFT) → $O(N \log N)$
drawing

Magnitude Spectrum¶

  • Each DFT coefficient $X[k]$ is a complex number with a real part (cosine component) and an imaginary part (sine component)

  • These can be converted to polar form:

    • Magnitude: $ |X[k]| = \sqrt{(\mathrm{Re}\,X[k])^2 + (\mathrm{Im}\,X[k])^2} $
    • Phase (angle): $ \phi[k] = \arctan2(\mathrm{Im}\,X[k], \mathrm{Re}\,X[k]) $
drawing

Magnitude Spectrum¶

  • In many applications:

    • We use only the magnitude spectrum— or more often, the power spectrum (squared magnitude)
    • The phase is often ignored — the human ear is mostly insensitive to phase, except in some edge cases (e.g., localization, transients)
  • The magnitude of bin $X[k]$ reflects how much energy is present in its frequency band

drawing

Frequency Resolution¶

  • An $N$-point FFT (typically with $N$ as a power of 2) produces $\frac{N}{2} + 1$ frequency bins for real-valued input
  • These bins are uniformly spaced from $0$ to $\frac{S}{2}$ Hz, where $S$ is the sampling rate
  • The frequency spacing between bins is: $\Delta f \approx \frac{S}{N}$ → This is often called the frequency resolution

Example:¶

  • Sampling rate: $S = 32,000$ Hz
  • FFT size: $N = 1024$

$$\Delta f \approx \frac{32000}{1024} \approx 31.25\ \text{Hz}$$

→ You get 513 bins, each representing a frequency band ~31.25 Hz wide, from 0 Hz up to 16,000 Hz

Frequency Resolution¶

  • To get more finely spaced bins in the frequency domain:
    • You can increase $N$ by zero-padding the signal (appending zeros)
    • This improves visual resolution, but not true spectral resolution
drawing

Spectral Leakage and Windowing¶

  • The DFT assumes the signal is periodic over the analysis window
    → It treats the time-domain signal as if it's infinitely repeated
drawing
  • If a sinusoid does not complete an integer number of cycles within the window:
    • It gets cut off at the edges
    • This introduces discontinuities at the window boundaries

Spectral Leakage and Windowing¶

  • These sharp edges cause spectral leakage:
    • Energy from a single frequency spreads into many DFT bins
    • This effect requires many basis functions to explain the edge artifacts
drawing
  • To reduce discontinuities at the window edges, we multiply the signal with a window function → This smoothly tapers the signal to zero at the edges
    → Prevents sharp "cuts" that cause spectral leakage

Spectral Leakage and Windowing¶

  • With proper windowing, sine waves that don’t perfectly align with DFT bins produce cleaner, more localized peaks

  • Common window functions:

    • Hann, Hamming, Blackman, Gaussian, etc.
  • There’s a trade-off between narrow peak and low side lobes

drawing

Short-Time Fourier Transform (STFT)¶

  • A regular DFT tells us nothing about when things happen
  • The Short-Time Fourier Transform (STFT) computes a time–frequency representation (spectrogram) by:
    • Splitting the signal into short, overlapping windows
    • Computing the DFT separately for each window
drawing

Short-Time Fourier Transform (STFT)¶

  • Choose your window length wisely:
    • Why?
drawing

Short-Time Fourier Transform (STFT)¶

  • Choose your window length wisely:
    • Long window → better frequency resolution, worse time resolution
    • Short window → better time resolution, worse frequency resolution
    • Use a length where the signal is approximately stationary within a window
drawing

Mel Transform¶

  • After computing the STFT, we usually apply a Mel filterbank to transform the linear frequency spectrogram into a perceptually motivated, compressed representation

  • Motivation from human perception: The human ear does not perceive frequency linearly. We are more sensitive to changes in low frequencies than in high frequencies.

  • This can be implemented as a matrix multiplication:
    $$ \mathrm{mel\_spec} = \mathrm{torch.matmul}(\mathrm{mel\_filterbank}, \mathrm{spectrogram}) $$

  • Purpose:

    • Reduce dimensionality of the spectrogram
    • Emphasize features most relevant to human hearing

Mel Filterbank¶

drawing

Logarithmic Compression of Amplitude¶

  • The human ear does not perceive loudness linearly:

    • A 10× increase in sound power results in approximately a 2× increase in perceived loudness
    • This follows a power-law relationship: $ \text{Loudness} \propto \text{Power}^n $, with $ n \approx 0.3 $
    • To handle the wide dynamic range of real-world sounds, we use a logarithmic scale — the decibel (dB) scale — which aligns well with human perception
  • To convert sound power ( P ) to decibels: $$ \text{dB} = 10 \cdot \log_{10}\left(\frac{P}{P_{\text{ref}}}\right) $$

Overview - Practical Example¶

Let's look at a typical preprocessing pipeline (see example pipeline on GitHub):

  • Pre-emphasis filter: a simple FIR filter applied in the time domain to amplify high frequencies and flatten the spectrum
  • Short-Time Fourier Transform (STFT)
  • Power spectrogram: compute the squared magnitude of the complex STFT
  • Mel Transform
  • Logarithmic Amplitude compression → log mel spectrogram

An Example: from the waveform to the log mel spectrogram¶

Our starting point: ... a dog barking

In [6]:
sr = 32000 
example_wav, _ = liro.load(example_file, sr=sr)
wav_plot(example_wav, sr)
No description has been provided for this image
Your browser does not support the audio element.

Apply Pre-emphasis (Digital Filter)¶

  • In natural audio signals, low frequencies tend to dominate, with energy typically dropping ~2 dB per kHz

  • This spectral imbalance can mask important details in higher frequencies

  • A pre-emphasis filter is a simple FIR (Finite Impulse Response) filter applied in the time domain to:

    • Flatten the spectral envelope
    • Boost higher frequencies

Apply Pre-emphasis (Digital Filter)¶

In [7]:
preemphasis_coefficient = torch.as_tensor([[[-.97, 1]]])
wav_torch = torch.from_numpy(example_wav)
wav_pree = nn.functional.conv1d(wav_torch.reshape(1, 1, -1), preemphasis_coefficient).squeeze(1)
In [8]:
freq_plot(preemphasis_coefficient.squeeze().numpy(), sr, title="Pre-emphasis Filter Frequency Magnitude Reponse")
No description has been provided for this image
In [9]:
spec_liro(example_wav, sr, title="Log Spectrogram (without Pre-emphasis)")
spec_liro(wav_pree.squeeze().numpy(), sr, title="Log Spectrogram (with Pre-emphasis)")
No description has been provided for this image
No description has been provided for this image

STFT¶

In [10]:
n_fft, win_length, hop_length = 1024, 800, 320 
window = torch.hann_window(win_length)
spec = torch.stft(wav_pree, n_fft=n_fft, hop_length=hop_length,
                  win_length=win_length, window=window, 
                  return_complex=True)
print("Complex spec shape: ", spec.shape)
spec = torch.view_as_real(spec)
print("Real spec shape: ", spec.shape)
power_spec = (spec ** 2).sum(dim=-1)
# for comparison, we calculate also the magnitude spectrogram
mag_spec = torch.sqrt(power_spec)
Complex spec shape:  torch.Size([1, 513, 1000])
Real spec shape:  torch.Size([1, 513, 1000, 2])

STFT Window¶

In [11]:
wav_plot(window.squeeze().numpy(), sr, listen=False, title="Hann window (time-domain)")
No description has been provided for this image
In [12]:
spec_liro(mag_spec.squeeze().numpy(), sr, x_is_spec=True, convert_to_db=False, title="Magnitude Spectrogram")
spec_liro(power_spec.squeeze().numpy(), sr, x_is_power_spec=True, convert_to_db=False, title="Power Spectrogram")
No description has been provided for this image
No description has been provided for this image

Mel Transformation¶

In [13]:
n_mels, fmin, fmax = 40, 0.0, sr // 2
mel_basis, _ = torchaudio.compliance.kaldi.get_mel_banks(n_mels, n_fft, sr,
                                                         fmin, fmax, 
                                                         vtln_low=100.0,
                                                         vtln_high=-500.,
                                                         vtln_warp_factor=1.0)
# pad with one zero per mel bin to match n_fft // 2 + 1
mel_basis = torch.as_tensor(torch.nn.functional.pad(
    mel_basis, (0, 1), mode='constant', value=0)
)

print(mel_basis.shape)
torch.Size([40, 513])
In [14]:
fig, ax = plt.subplots(nrows=1, figsize=(10, 4))
ax.set_title("Mel filterbank")
ax.set_xlabel("FFT bin index")
ax.set_ylabel("Mel bin")
ax.imshow(mel_basis.squeeze().numpy(), cmap='hot', interpolation='nearest', aspect='auto')
plt.show()
No description has been provided for this image
In [15]:
melspec = torch.matmul(mel_basis, power_spec)
spec_liro(power_spec.squeeze().numpy(), sr, x_is_power_spec=True, title="Log Power Spectrogram")
spec_liro(melspec.squeeze().numpy(), sr, x_is_mel_spec=True, title="Log Mel Spectrogram")
No description has been provided for this image
No description has been provided for this image

Log the Amplitude¶

In [16]:
log_mel_spec = (melspec + 0.00001).log()
spec_liro(melspec.squeeze().numpy(), sr, x_is_mel_spec=True, convert_to_db=False, title="Mel Spectrogram")
spec_liro(log_mel_spec.squeeze().numpy(), sr, x_is_mel_spec=True, convert_to_db=False, title="Log Mel Spectrogram")
No description has been provided for this image
No description has been provided for this image

What to do with a log mel spectrogram?¶

Use your favorite vision architecture and treat the log mel spectrogram as an image with a single input channel.

Example Pipeline on GitHub¶

The example ML4Audio pipeline demonstrates the following points based on 200 wav files:

  • Dataset loading, PyTorch Dataset class, PyTorch Dataloader
  • Audio Signal processing routine that we discussed today
  • How to use a PyTorch Model (CNN) to generate predictions based on a log mel spectrogram
  • Simple data augmentation techniques (masking time frames, masking frequency bands, mixup, time rolling)
  • A training loop implemented with Pytorch Lightning
  • Logging implemented with Weights and Biases
  • Some of the best practices we discussed

Your next tasks¶

Until 02.04.25 (one week)¶

Prepare a short presentation (10 minutes) introducing the baseline system for your task.

  • The baseline code will be available by the end of this week or early next week for Tasks 1, 6, AES QVIM.
  • For Task 4, the baseline may be released sometime next week. If your baseline is not yet available, prepare your presentation based on a relevant related system, and include the same key aspects listed on the next slides — as if it were your baseline.

Until 02.04.25 (one week)¶

  • Aspects to look into:
    • What datasets is the baseline trained on?
    • What kind of input representation is used (e.g., log-mel spectrogram, waveform)?
    • Are there any data augmentation techniques (e.g., mixup, SpecAugment, noise injection)?
    • What neural network architecture is used?
    • What loss function is used?
    • What evaluation metric is used?
    • On what validation/test split is performance reported?
    • How long does it take to reproduce the baseline system?
    • Are there any class imbalances or domain shifts the baseline does handle (or not)?
    • What optimizer and learning rate schedule?
    • What are the key hyperparameters (batch size, learning rate, etc.)?
    • How far is it from the SOTA?

Until 09.04.25 (two weeks)¶

  • Prepare a presentation (10 minutes).
  • Successfully reproduce the baseline results and show us the logged metrics.
    • Let us know soon if you encounter any problems.
  • Tell us about your plans for improving the baseline over the easter break.

Presentations¶